Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound as if uttered by a different speaker, while keeping other aspects like content unchanged. Current VC methods, focus primarily on spectral features like timbre, while ignoring the unique speaking style of people which often impacts prosody. In this study, we introduce a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker. The proposed approach is based on a pretrained, self-supervised, model for encoding speech to discrete units, which make it simple, effective, and easy to optimise. We consider the many-to-many setting with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate the proposed approach is significantly superior to the evaluated baselines. Code and samples can be found under https://pages.cs.huji.ac.il/adiyoss-lab/dissc/ .
translated by 谷歌翻译
发现普遍的对抗性扰动的存在对对抗性学习领域具有很大的理论和实际影响。在文本域中,大多数通用研究都集中在添加到所有文本中的对抗前缀上。但是,与视觉域不同,在不同输入中添加相同的扰动会导致明显不自然的输入。因此,我们介绍了一种新的通用对手设置 - 一种通用的对抗性政策,它具有其他普遍攻击的许多优势,但也导致有效文本 - 从而使其在实践中具有重要意义。我们通过在许多文本上学习保存文本更改的一组语义集,学习单个搜索策略来实现这一目标。这种公式是普遍的,因为该政策成功地在新文本上找到了对抗性示例。我们的方法使用文本扰动,这些扰动已被广泛显示,以在非普遍设置(特定的同义词替代品)中产生自然攻击。我们建议对使用强化学习的这种表述进行强有力的基线方法。它可以概括(从几乎没有500个培训文本)表明文本域中也存在普遍的对抗模式。
translated by 谷歌翻译